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Motivated by the challenges associated with accounting for the ascertainment when 
analyzing secondary phenotypes that are correlated with case-control status, Lin and Zeng 
have proposed a method that properly reflects the case-control sampling (Lin and Zeng, 
2009). The Lin and Zeng method has the advantage of accurately estimating effect sizes 
for secondary phenotypes that are normally distributed or dichotomous. This method can 
be computationally intensive in practice under the null hypothesis when the likelihood 
surface that needs to be maximized can be relatively flat. We propose an extension of the 
Lin and Zeng method for hypothesis testing that uses proportional odds logistic regression 
to circumvent these computational issues. Through simulation studies, we compare the 
power and type-1 error rate of our method to standard approaches and Lin and Zeng's 
approach. 
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INTRODUCTION 

For the analysis of secondary phenotype data collected in a 
case-control study, Lin and Zeng have proposed a method that 
properly reflects the case-control sampling (Lin and Zeng, 2009). 
This work is motivated by the challenges associated with account- 
ing for the ascertainment when analyzing secondary phenotypes 
that are correlated with case-control status. Several methods 
have been proposed that accurately estimate the odds ratio of 
genetic variants for binary secondary phenotypes associated with 
case-control status, but most of these methods do not read- 
ily accommodate continuous secondary phenotypes (Greenland, 
2003; Kraft, 2007; Richardson et al, 2007; Monsees et al., 2009; 
Li et al., 2010; Wang and Shete, 2011a,b; He et al., 2012; Li and 
Gail, 2012). While two of these methods use an inverse probabil- 
ity weighted (IPW) regression approach that can accommodate 
continuous secondary phenotypes, these methods focus on cor- 
recting for the bias in the estimator due to the ascertainment 
conditions and involve a known disease rate (Richardson et al., 
2007; Monsees et al., 2009). Since this paper focuses on hypothesis 
testing versus estimation of disease-association parameters with 
an equal number of cases and controls, we do not present these 
methods here. 

Alternatively, the Lin and Zeng method has the advantage 
of accurately estimating effect sizes for secondary phenotypes 
that are normally distributed or dichotomous (Lin and Zeng, 
2009). Under the null hypothesis when the likelihood surface 
that needs to be maximized can be relatively flat, this method 
can be computationally intensive in practice. To circumvent these 
computational issues, we propose an extension of the Lin and 



Zeng method for hypothesis testing that uses proportional odds 
logistic regression. Since the approach by Lin and Zeng has the 
advantage that effect sizes can also be estimated, we recommend 
the following work-flow for the analysis of continuous secondary 
phenotypes. 

1 . Test all SNPs with our approach using proportional odds logis- 
tic regression since the vast majority of SNPs will be under the 
null hypothesis. 

2. For the significant SNPs, apply Lin and Zeng's method to 
obtain parameter estimates and confidence intervals. 

This proposed approach circumvents the computational issues 
encountered in the Lin and Zeng approach under the null hypoth- 
esis, but utilizes the Lin and Zeng's method to accurately estimate 
effect sizes for significant SNPs found in Step 1. Through simula- 
tion studies, we compare the power and type- 1 error rate of our 
method to standard approaches and Lin and Zeng's approach. 

METHODS 

When the secondary phenotype is normally distributed, Lin and 
Zeng propose an adjusted score test that incorporates genetic 
associations with affection status into the test statistic and models 
the likelihood function as follows (Lin and Zeng, 2009): 
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where D denotes the case-control status ( 1 = case and 0 = con- 
trol), Y denotes the secondary phenotype, n denotes the total 
number of subjects, and X denotes the genotype of interest. 
Lin and Zeng calculate P(D, = 1) = ^ ^P(D, = 

y * 

l\x, y)P(y\x)P(x). The probability P(D\X, Y) is defined as a 
logistic regression model. They model P(Y\X) as a logistic 
regression for dichotomous Y or a linear regression for normally 
distributed Y. They maximize the likelihood with respect to P(X) 
by the Newton Raphson algorithm. In this framework, likelihood 
based statistics (i.e., Wald, score, and likelihood-ratio statistics) 
can be used to make inference. 

The Lin and Zeng approach requires the secondary phenotype 
to be normally distributed and the method can be problem- 
atic under the null hypothesis since the likelihood surface that 
needs to be maximized can be relatively flat. Since Lin and Zeng's 
method estimates the parameters in the model by maximizing 
the likelihood given in Equation (1), the approach is numeri- 
cally exhaustive when testing a large number of SNPs where a 
majority of the SNPs are under the null hypothesis. This is a 
result of the maximization of the likelihood function being dif- 
ficult under the null hypothesis, since the surface can be flat due 
to the ascertainment condition. 

If the primary goal of the secondary phenotype analysis is 
hypothesis testing as opposed to estimation of disease-association 
parameters, an alternative approach is to use the following likeli- 
hood composition, which ultimately does not require maximizing 
a relatively flat likelihood surface. Therefore, for the association 
testing of secondary phenotypes in case-control studies, we pro- 
pose using a simpler break down of the likelihood that requires 
few assumptions. 
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Under the null hypothesis, X is independent of Y given D and any 
confounders. The likelihood ratio test becomes 
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As a result, one only needs to model P(X\D) and P(X\ Y , D). For 
an additive genetic model, i.e., X = 0, 1,2, corresponding to allele 
counts, instead of modeling the likelihood function, one can use 
a cumulative logistic regression model with proportional odds 
proportional for P(X\D) and the P(X\ Y, D) such that 



logit[P(X <j\Y, D)] = 
logit[P(X < j\D)] 



uij + 8i Y Y + S W D 
= a 0 j + S 0D D 



(4) 



for j = 0, 1. To control for any known confounders, these covari- 
ates can be added to Equation (4). This model assumes the same 
effect for different cumulative logits (Agresti, 2002). If assump- 
tions are not met then we recommend a link function for which 
the response curve is non-symmetric or adding a dispersion 
parameter. For imputed dosages,;' becomes the number of dosage 
levels minus one, meaning the levels of X in the cumulative 
logistic regression are increased to the number of dosage levels 
minus one. 

SIMULATIONS 

To assess the performance of this approach and compare it to Lin 
and Zeng's method, we conducted simulation studies following 
Lin and Zeng's manuscript with a MAF of 0.3, an additive mode 
of inheritance, and a = 0.01 level of significance (Lin and Zeng, 
2009). We also compared both of these methods to the standard 
case-only method, control only method and combined case and 
control method where both cases and controls are included in the 
analysis. For the model of the secondary quantitative trait Y and 
the disease D, 



P(D= 1LY, Y) 



Y\X ~ N (po + PiX, a 2 ) 

expiyo + YiX+y 2 Y) 



1 + exp(y 0 + y\X+y 2 Y) 



(5) 
(6) 



where /3n = er = 1, fi\ = 0 under the null hypothesis and /3i = 
—0.12 under the alternative hypothesis. We let yi = log(2), y\ 
varies from 0 to log( 1.5), and yo was chosen such that the disease 
rate is 1% or 5%. For each combination of simulation parameters, 
we generated 1000 data sets with 500 cases and 500 controls. 

Figure 1 shows the type 1 error rates and power for a disease 
rate of 1% and 5%. Our method, using the proportional odds 
logistic regression, maintains the type 1 error rate and has slightly 
higher power as compared to Lin and Zeng's method and supe- 
rior power compared to the other methods. While the proposed 
method and Lin and Zeng's method have similar power, the pro- 
posed method is computationally more feasible under the null 
hypothesis than Lin and Zeng's method since it does not involve 
maximizing a relatively flat likelihood surface. The computing 
time for the proposed approach is under 1 s per SNP where as the 
software associated with the Lin and Zeng approach needs to be 
run multiple times if there are issues with convergence which can 
take 5 min to an hour per SNP. When running a GWAS with about 
500,000 SNPs, this difference in computing time per SNP can be 
substantial. To examine this concept further, the plot on the left 
in Figure 2 shows the log Likelihood specified by Lin and Zeng 
for varying values of /3q and fi\ with all other parameters fixed at 
their true values and for data generated under the null hypoth- 
esis with yi = log{l.5) and the disease rate equal 5%. The plot 
on the right is the log Likelihood specified by Lin and Zeng for 
varying values of y\ and y 2 with all other parameters fixed at their 
true values, and for data generated under the null hypothesis with 
y x = log{\.5) and the disease rate equals 5%. The red dots on the 
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FIGURE 1 | Type 1 error rates and power for a disease rate of 1% 

and 5%. As seen in the plots above the new method using 
proportional odds logistic regression maintains the type 1 error rate. 


The new method has similar power compared to Lin and Zeng's 
method called SPREG and superior power compared to the other 
methods. 



plots represent the true maximum. The surface for jSo and fi\ has a 
clear maximum whereas the surface for yi and yo is relatively flat, 
demonstrating the difficulty in maximizing the likelihood surface 
defined by Lin and Zeng under the null hypothesis. 

DISCUSSION 

While the power of the proposed method is comparable to the 
method of Lin and Zeng, the proposed approach does not have 
the issue of maximizing a flat likelihood surface under the hull 
hypothesis that can be computationally intensive. Since the pro- 
posed approach is limited in it's ability to accurately estimate 
effect sizes while the approach by Lin and Zeng has the advan- 
tage that effect sizes can be accurately estimated, we recommend 
the following work-flow for the analysis of secondary phenotypes. 



1. Test all SNPs with the proposed approach using proportional 
odds logistic regression since the vast majority of SNPs will be 
under the null hypothesis. 

2. For the significant SNPs, apply Lin and Zeng's method to 
obtain parameter estimates and confidence intervals. 

By using our approach to test all the SNPs in the GWAS, the 
hypothesis testing can be done quickly and efficiently since our 
approach does not suffer from this issue of maximizing a flat 
likelihood surface under the null hypothesis. By obtaining param- 
eter estimates for only the significant SNPs with Lin and Zeng's 
method, one can make sure that the likelihood is properly max- 
imized which is too computational exhaustive to apply to the 
entire GWAS. 
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FIGURE 2 | Log Likelihood surface specified by Lin and Zeng. The plot on 
the left is the log Likelihood specified by Lin and Zeng for varying values of Po 
and Pi with all other parameters fixed at their true values and for data 
generated under the null hypothesis with y>i = /og(1 .5) and the disease rate 
equal 5%. The plot on the right is the log Likelihood specified by Lin and Zeng 
for varying values of yt and y2 with all other parameters fixed at their true 




0.8 



values and for data generated under the null hypothesis with y-j = tog(1 .5) 
and the disease rate equal 5%. The red dots on the plots represent the true 
maximum. The surface for Po and (8i has a clear maximum whereas the 
surface for yi and yo is relatively flat, demonstrating the difficulty in 
maximizing the likelihood surface defined by Lin and Zeng under the null 
hypothesis. 



There are potential limitations associated with this strategy of 
combining two methodological approaches to reduce the compu- 
tational burden while still being able to estimate the parameters 
of interest. While the two approaches have comparable power, a 
relatively small number of SNPs that are significant from the new 
approach may not be significant in the Lin and Zeng's method 
and vice versa. Also both approaches may have issues if the case 
control status is extremely correlated with the secondary phe- 
notype. In this case, the secondary phenotype is not providing 
new information compared to the case-control status and these 
methods for testing secondary phenotypes in case-control genetic 
association studies are not applicable. 
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